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Abstract The volume of datasets is increasing in a very 
fast rate due to the expansion of digitalization of each file 
of work. The traditional clustering algorithm becomes 
ineffective in analyzing such huge volume of datasets as it 
requires large time to cluster such huge volume of datasets. 
The parallel and distributed architectures are designed to 
process such large datasets. In order to obtain efficiency in 
clustering job, traditional clustering algorithms are required 
to be designed for such parallel and distributed architec- 
tures. Few parallel clustering algorithms are designed for 
gaining efficiency in clustering which works on datasets 
which are loaded and accessed from main memory, which 
in turn develops a limitation in clustering large datasets 
that cannot load millions of data objects in memory at 
once. In this work, we have proposed a parallel version of 
traditional K-means so as to execute it over Hadoop dis- 
tributed framework. The experimental results show that our 
proposed K-means algorithm outperforms traditional 
K-means while clustering large volume of datasets. 
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Introduction 


Data analysis has multiple approaches and facets, covering 
various methods implemented in different working fields 
such as scientific experimentations, business analytics, 
social science and networking [1]. Traditional data analysis 
techniques and algorithms are incapable of analyzing large 
datasets efficiently [2]. The large data analysis is a chal- 
lenging task as it requires processing machine with a high 
computing power, large memory and storage systems and 
fast communication among them to store large datasets, 
performing search operations by sharing a dataset, keeping 
data privacy and providing an understandable visualization 
[3]. These large datasets usually are out of capacity of 
traditionally utilized algorithms to retrieve useful patterns 
which are diverse and complex [4]. Data clustering is a 
widely used data analysis tool used for grouping similar 
data objects together such that the similarity of objects in a 
particular group is higher than the objects of different 
clusters (homogeneity among the same group objects and 
heterogeneity between the different group objects) [5, 6]. 
Clustering algorithms are widely used in different fields of 
work which include pattern recognition, data categoriza- 
tion, machine learning, information retrieval, image anal- 
ysis and bioinformatics [7, 8]. 

The most popular clustering algorithms are arguably 
based on partitioning due to the fact that these algorithms 
are simple to understand, easy to implement and minimal 
execution time complexity as compared to other techniques 
[9]. Partition-based algorithms create k number of parti- 
tions of a dataset, each representing a cluster. K-means is a 
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popular partition-based clustering algorithm [10, 11]. 
K-means takes the input of cluster numbers from the user 
and selects randomly k objects (known as centroids or 
cluster centers) from the dataset. Iteratively, K-means 
assigns objects to the centroids depending on minimal sum 
of squared distance between them [12]. A new centroid is 
calculated in each iteration and fed to the next iteration as 
an input centroid. K-means is efficient in terms of execu- 
tion time and formation of clusters in a less noisy dataset 
[13]. 

Apache Hadoop is a free software model written in Java 
for distributed processing and distributed storage of mas- 
sive datasets on computer clusters made from commodity 
hardware [14, 15]. Every module in Hadoop is designed 
with a basic assumption that the hardware failures (of racks 
of machines or of individual machines) are commonplace 
and thus should be automatically dealt with in software by 
the model [16]. Hadoop consists of two main modules: (1) 
the storage part which is managed by Hadoop Distributed 
File System (HDFS) and (2) a processing part which uses 
MapReduce programming paradigm. The HDFS [16] pro- 
vides a distributed, scalable, fault-tolerant and portable file 
system written in Java for managing input file splits and 
gathering results from different nodes of a cluster. 
A Hadoop cluster has nominally one namenode (master) 
along with a cluster of datanodes (slaves). The data are first 
uploaded to the master node which splits the datasets and 
transfers through network to slave nodes using a specific 
HDFS protocol [17]. All computation on data in MapRe- 
duce requires an input in «key, value> pairs which process 
the data and produce result in output «key, value- pairs. 
MapReduce executes an algorithm using two functions: 
Map and Reduce. 

As stated before, HDFS splits the datasets and sends 
them to slave nodes. The user can specify a particular 
partition function such as hash (key) MOD R, where R is 
the number of partitions [16]. The Map function call from 
the master node is distributed among slave nodes. Each 
Map function processes a split of data in a slave node as 
distributed by HDFS. Thus, mappers process datasets in 
parallel among slave nodes. Mappers produce intermediate 
«key, value pairs after processing input «key, value 
pairs. These intermediate «key, value> pairs are inputted 
to the reducer. The reducer combines all values obtained 
from a key I from mapper. In general, one reducer produces 
either one or zero output values. The intermediate pairs 
from mapper are provided to reducer using an iterator 
which enables the data processing to smartly deal with 
datasets which do not fit to memory at once [18]. 

The following sequence of activities occur when the 
user program calls the MapReduce function (the numbered 
labels in Fig. 1 correspond to the numbers in the list 
below): 
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e The MapReduce library in the user program first divides 
the input files into M pieces of usually 16-64 mega- 
bytes (MB) per piece which is controllable by the user 
via an optional argument. Then again it starts up various 
copies of the program on a cluster of machines [19]. 

e One of the copies of the program is particular: the 
master. The remaining are workers that are allotted 
work by the master. There are M Map tasks and 
R Reduce tasks to allot. The master chooses idle 
workers and allots each one a Map task or a Reduce 
task [20]. 

e A worker who is allotted a Map task reads the contents 
of the related input split. It parses key/value pairs from 
the input data and passes each pair to the user-defined 
Map function. The intermediate key/value pairs 
obtained by the Map function are buffered in memory 
[21]. 

e Periodically, the buffered pairs are written to local disk, 
divided into R regions by the partitioning function. The 
positions/locations of these buffered pairs on the local 
disk are communicated back to the master, who is 
responsible for sending these positions to the Reduce 
workers [21]. 

e When a Reduce worker is alerted by the master about 
these locations, it utilizes remote procedure calls to read 
the buffered data from the local disks of the Map 
workers. When a Reduce worker has read all interme- 
diate data, it classifies/sorts it by the intermediate keys 
so that all events of the same key are grouped together. 
The sorting is required because typically many varieties 
of keys Map to the same Reduce task. If the amount of 
intermediate data is too huge to fit in memory, an 
external sort is employed [22]. 

e The Reduce worker iterates over the sorted intermediate 
data, and for every unique intermediate key encoun- 
tered, it passes the key and the related set of interme- 
diate values to the user’s Reduce function. The output 
of the Reduce function is appended to a final output file 
[22]. 


As we have discussed above, Hadoop enables an algo- 
rithm to work with thousands of nodes and terabyte of data, 
without relating to the user with too much detail on the 
allocation and distribution of data and calculation [23]. So, 
Hadoop and MapReduce recognize the mass data storage, 
analysis. and exchange management technology [24]. 
Therefore, it is a good idea to modify the traditional 
K-means algorithm in MapReduce paradigm and run it on 
the top of Hadoop. In this work, we have designed and 
executed K-means on top of Hadoop in MapReduce para- 
digm in order to perform clustering of large volume dataset 
efficiently. We have also compared the results of MapRe- 
duce-based K-means over traditional K-means with the 
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Fig. 1 Overall flow of a 
MapReduce operation in our 


97 


implementation 
1 
Pd SS 
ee 2 z 
Worker — e x 


Input files 


1 —fork 


5 —*Remote read 


same experimental setup. This work is not reported in the 
previous literature as cited above. 


Experimental Implementation 
Datasets 


In our work, we have used datasets of documents con- 
taining plain text. A wordcounter module will go through 
the datasets and create a dictionary for the wordmapper 
module to Map the words in our datasets. These datasets of 
words are converted to vector using vector space model, 
where the index of each word which was chosen for the 
dictionary will be converted to vectors in terms of key and 
value. Converting to vector is done using tf-idf weights 
[25]. These vectors are given to our program which will 
create a file of clusters for that program. The vectors and 
centroid files are allotted to the Hadoop system as input 
dataset for K-means clustering. These datasets we are using 
in both of our experiments of the sequential K-means and 
parallel K-means. 


Vector Space Model 


Vector space model was proposed by Salton et al. [26]. The 
dataset in the form of basic text files is given to a Process 
Corpus module and is converted to vectors using tf-idf 
weights [27, 28]. Vector is a file with key and value of each 
word and index of those words in terms of floating point 
number. These values are generated by multiplying the 
index by the size of the word and taking a log of it and 
normalizing the value. The output file is the basic text file 
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which will be further used for making clusters. The term- 
specific weights in the document vectors are products of 
global and local parameters. The model is known as term 
frequency-inverse document frequency model. The weight 
vector for document d is: v; = loia, 2d,- 3,4] 


IDI 


|(d' € Dit e d'}| (1) 


Ord = tf, a log 


and 


e Tf,q is term frequency of term t in document d (a local 
parameter). 

° log repre] is inverse document frequency (a global 
parameter). IDI is the total number of documents in the 
document set; |(d'c D|t € d'}| is the number of 


documents containing the term f. 


Using the cosine the similarity between document d; and 
query q can be calculated as: 


dj.q = Eom Qi j-Vig (2) 
d;||llall N. o uM 
J Mini OL Dini ig 


The vector space model has the following advantages 
over other models: 


sim(d;, q) a I 


e Simple model depending on linear algebra. 

e Term weights not binary in nature. 

e Allows computing an uninterrupted degree of similarity 
between documents and queries. 

e Allows ranking documents with respect to their possible 
relevance. 

e Allows partial matching. 
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Generating Initial Cluster Centroids 


This vector is employed to compute the initial set of cen- 
troids from the data. This is achieved by giving vector and 
number of clusters as input to obtain cluster centroids. This 
initial set of cluster centroids is simply randomly sampled 
from the vectors file. 


Euclidean Distance 


Given two documents d, and d, represented by their term 
vectors f, and t,, respectively, the Euclidean distance is 
given simply as: 


m 


Y eua — or] (3) 


t=1 


where the term set is T = ti, ..., tm. AS mentioned previ- 
ously, we use the tf-idf value as term weights, that is 
Ora = tfidf(d,, t). 


Parallel K-Means Algorithm 


The generated cluster file and vector file along with the 
number of iterations are given as arguments to the Hadoop 
execution. The cluster file and data file are saved on the 
HDFS so that it can be accessed by the Hadoop. This 
program defines the minimum and maximum input split 
size. Input split size will define how the data are divided 
into the package to be given to datanodes. The larger the 
block size, the larger the input split size should be. If the 
split size is very small the performance will be the worst 
case and will go more than sequential. Now we will be 
comparing the execution of our sequential K-means code 
with the parallel implementation with different input split 
sizes in the next chapter. 


Pseudo-codes 

Vector Space Model 

e Step 1: Tokenization: <key, value> pair is input into 
Map function, where key represents the name of a 
document and value represents texts in the document. 


The Reduce function provides output also in pairs 
where key is the name of a document and values are 
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tokens (words) in that document. Ex: Key:/filel.txt: 
Value: [child, process, created, fork, share, ordinary, 
variables, parent, process, parent, write, file, descriptor, 
child, process] 

e Step 2: Dictionary file: Each token in all documents is 
assigned with a unique number. Dictionary is generated 
by providing the input to the Map function as «doc- 
ument name, wordlist>, and the Reduce function 
outputs «Word, uniqueid>. Ex: Key: filel.txt : Value: 
231. 

e Step 3: Frequency count: Frequency is the total number 
of times a word appeared globally in all documents. The 
output of the Map function values is accumulated, and it 
finds the sum in Reduce function. The output format of 
Reduce function is as follows: Ex: Key: 50: Value: 2 

e Step 4: Calculate term frequency: The input in this step 
is provided to MapReduce as <document name, word- 
list>, and it counts the occurrence of each word or term 
t; in that document. This stage outputs the document 
dataset as «document name, (t; : count]. Ex: filel.txt 
is the name of the document and the values are listed in 
the format of (wordid : count). For example: Key:/- 
filel.txt: Value: (3279:7.0, 432:1.0, 324:3.0,...} 

e Step 5: Calculate term frequency-inverse document 
frequency (tf-idf) value: This step gets the input to Map 
function from the output of step 4. It calculates the 
weight vector as tf * tf-idf, for t; term of each document 
d; as already provided in Eq. (1). It produces the result 
in the form «document name, (t1 :tf-idf1, t2:tf-1df2,...ti 
:tf-idfi}>. For example: Key: filel.txt: Value: 
(3279:0.12728, 432:0.12728, 324:0.08060...} 


The MapReduce-Based K-Means Algorithm 


The MapReduce K-means takes two inputs: the number of 
clusters (K) to be formed and the number of documents 
(D). It first creates K centroids randomly choosing K doc- 
uments from documents dataset. Using Euclidean distance, 
it calculates the distance between the centroids and docu- 
ments. It updates the centroid for each cluster after calcu- 
lating the smallest distance between the centroids and 
documents. Between two consecutive iterations if there is 
no change in the cluster centers, then the algorithm con- 
verges. 
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Algorithm 1 K-Means Algorithm 


Output: K clusters of input documents dataset 


for each document d; in D do 
forjcltoK 


end for 


WO o CIC ON a ES: 


. end for 


—— 
—- c 


. For each cluster update centroid 
. end loop 
. end K-Means 


=. = 
w N 


Input: K: cluster number, D: Top N documents 
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Randomly choose K documents from D then produce K centroids C1,C2,....,Cx 
Repeat: between two consecutive iterations until there is no change in cluster centers 


calculate the distance between the documents and centroids using Euclidean distance measure 


calculate the minimum distance between the centroid and the documents 


K-Means Mapper 


It takes two inputs: the set of initial centroids C and the set 
of objects X (documents). Each document is initialized by 
setting the best centroid to the null and minimum distance 
to infinity. Now for each centroid, the algorithm calculates 
the distance between the centroid and the documents. If the 
distance from centroid c is originated below the minimum 
distance, minDist is assigned to dist and bestCentroid is 
assigned to c. These steps are repeated for each centroid 
and each document to discover the bestindex and the 
minimum distance. The algorithm yields pairs of centroids 


and the documents. 
K-Means Reducer 


The algorithm takes (key, value) pairs as input. The key is the 
bestCentroid, and value denotes the objects allocated to the 
centroid by the mapper. Depending on the bestCentroids found, 
the algorithm adds the documents (accompanied by the number 
of documents) of old clusters to the new clusters. At last, it 
yields pairs of old centroids and the corresponding new cen- 
troids (bestCentroids). 


Algorithm 2 Procedure: Mapper function of K-Means 
1. Input: set of objects X = xi, X2,...,Xn, initial centroids set C = c1, C2, ..., Ck 
Output: A list. It contains twosomes of «Ci, xi» where 1<i<n and 1<j < k 


2 

3. Mlexi, X2,..., Xm 

4. current centroids —C 
5 


distance(p,q) = [Y (p; — qi)? , where pi (or qi) is the coordinate of p (or q) in dimension i 


6. forall xi e MI such that 1€ i € m do 
J: bestCentroid —null 

8 minDist<— oo 

9. for all c € current centroids do 
10. dist distance (xi, c) 
11. if (bestCentroid = null || dist < minDist) then 
12. minDist< dist 
13. bestCentroid — c 
14. end if 

15. end for 

16. emit (bestCentroid, xi) 

17. it=1 

18. end for 


19. return outputlist 
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Algorithm 3 Procedure: Reducer function of K-Means 
1. Input: Pairs of (Key, Value)assigned to the centroid by the mapper, where key — bestCentroid and 


Value = Objects 


M 


Output: Pairs of (Key, Value) where key 


= oldCentroid and value = newBestCentroid. 


newBestCentroid is the new centroid value calculated and derived for bestCentroid 


3. outputlist —outputlist from mappers 
4 v—1) 

5. newCentroidList < null 

6. for all D outputlist do 

7 centroid — B.key 

8. object — B.value 

9. [centroid] — object 

10. end for 

11. for all centroid € v do 

12. newCentroid, sumofObjects; 
13. sumofObjects — null 

14. for all object € v [centroid] do 
15. sumofObjects += object 
16. numofObjects += 1 

17. end for 


18. newCentroid — (sumofObjects + numofObjects) 


19. emit (centroid, newCentroid) 
20. end for 


Table 1 Performance for k = 10 


No. of documents Sequential K-means 1 node 2 nodes 3 nodes 
2000 15 15 13 14 
4000 18 16 15 15 
6000 20 18 16 17 
8000 21 18 17 18 
10,000 22 19 18 18 
20,000 27 24 22 21 


Experimental Results 
Experimental Setup 


The proposed work is carried out using a Hadoop cluster of 
four nodes. In this Hadoop cluster, there are three slave 
nodes and the master node is also configured as slave. So, 
The Hadoop cluster used for experimentation consists of 
one master and four slaves. In the next sections whenever 
we specify three-node Hadoop cluster it simply means that 
three slaves are used and there is one master along with 
configuration with slave functionality. Each node is made 
up of using commodity hardware having the same config- 
uration of Intel i3 processor with four cores of 3.07 GHz 
frequency, 1.5 GB of RAM and 500 GB of hard disk, and 
the nodes are connected with a 100-Mbps LAN network 
created using Ethernet technology. Hadoop, version 2.7.2, 
is used for making the cluster using Ubuntu 14.04 LTS 
operating system. The dataset used for the experimenta- 
tions is 20 newsgroups dataset which comprises 20 dif- 
ferent folders for different sections of news such as 
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religion, games and politics. Each folder consists of 
approximately 1000 documents, making it a dataset with 
20,000 documents. Each document is a collection of 
newsgroup articles in a plain text format. The performances 
are measured by considering 2000 documents and 
increasing it by 2000 till 20,000 documents. We will also 
be experimenting with the initial K value (initial centroids 
input to the K-means) from 10 to 30. We observed the 
difference in performance against sequential K-means, 
1-node (pseudo-distributed) Hadoop cluster and multi-node 
Hadoop cluster of three nodes. 


Experimental Results and Evaluation 
Dataset with 10 Clusters 


We have experimented with 2000—20,000 documents with 
an increment of 2000 documents. Table 1 shows the per- 
formance against sequential K-means and Hadoop-based 
K-means with three nodes for k = 10. Figure 2 shows the 
effect in performance for K = 10. Sequential K-means is 
way worse for larger datasets, and as the number of nodes 
increases, there is an improvement in the performance 
which is clearly shown in Fig. 2 and Table 1. The time 
observed is taken in seconds for a single iteration. 


Dataset with 20 Clusters 


In this, we took the value of k as 20 and performed both 
sequential K-means and Hadoop-based K-means on multi- 
nodes and checked their performance in seconds for a 
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Fig. 2 Performance for k = 10 K - 10 
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Fig. 3 Performance for k = 20 
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K = 30 


mSeq minode m2node m3node 
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2000 4000 6000 
B Seq 51 58 66 
m 1node 30 39 42 
m 2 node 27 33 36 
E 3 node 24 29 32 


Fig. 4 Performance for k = 30 


single iteration. MapReduce increases the performance of 
the K-means, and with more number of nodes, the per- 
formance is even better. We can see in Fig. 3 that as the 
size increases the performance improvement also increases. 


Dataset with 30 Clusters 


In this, we took the value of k as 30 and performed both 
sequential K-means and Hadoop-based K-means on multi- 
nodes and checked their performance in seconds for a 
single iteration. MapReduce increases the performance of 
the K-means, and with more number of nodes, the per- 
formance is even better as shown in Fig. 4. We can also see 
that as the size increases the performance improvement 
also increases. 


Conclusion 


We have experimented the sequential K-means and a par- 
allel implementation of K-means on Hadoop framework 
and compared their performance in terms of their execution 
time with different values of k and documents. We have 
observed that sequential K-means have a lot of drawbacks 
in terms of efficiency as compared to parallel K-means. In 
this work, by comparing the results we can see that the 
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8000 10000 200 
68 7? 110 
49 55 94 
41 48 66 
33 36 39 


parallel implementation of K-means clustering is more 
efficient than the sequential K-means. It is also observed 
that if we keep the input split size equal to the block size 
the performance is even more efficient. We observed that 
the larger the dataset, the more efficient the parallel 
K-means is. It is also observed that input split size and 
block size of HDFS play a major role in Hadoop execution 
for utilizing the network bandwidth and system memory. 
By observing the graphs plotted during performance anal- 
ysis we can clearly say that parallel K-means based on 
MapReduce is far more efficient than sequential K-means 
on larger datasets. 


Future Enhancements 


Improvements can still be done to increase the efficiency of 
parallel K-means based on MapReduce. First, along with 
the mapper and reducer module, we can use combiner 
module to Reduce the overhead on the reducer. Second, we 
can increase the block size so that there are fewer Map 
splits. Third, we can implement another version of 
K-means called constrained K-means which provides the 
background information of the dataset along with the data. 
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